A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Aythya marila is one of the few species of Anatidae, and the only Aythya to live in the circumpolar. However, there is a relative lack of research on genetics of this species. In this study, we reported and assembled the first high-quality chromosome-level genome assembly of A. marila. This genome was assembled using Nanopore long reads, and errors corrected using Illumina short reads, with a final genome size of 1.14 Gb, scaffold N50 of 85.44 Mb, and contig N50 of 32.46 Mb. 106 contigs were clustered and ordered onto 35 chromosomes based on Hi-C data, covering approximately 98.28% of the genome. BUSCO assessment showed that 97.0% of the highly conserved genes in aves_odb10 were present intact in the genome assembly. In addition, a total of 154.94 Mb of repetitive sequences were identified. 15,953 protein-coding genes were predicted in the genome, and 98.96% of genes were functionally annotated. This genome will be a valuable resource for future genetic diversity and genomics studies of A. marila.

(2023) 10:254 | https://doi.org/10.1038/s41597-023-02142-x www.nature.com/scientificdata www.nature.com/scientificdata/ Using Illumina ® TruSeq ® Nano DNA Library Prep kits to generate sequencing libraries of genomic DNA, we constructed a paired-end library with an insertion size of 350 bp, which was sequenced using Illumina HiSeq 4000 platform. Finally, a total of 60.77 Gb data (coverage of 45.30×) of 150 bp paired-end reads were generated. Simultaneously, using the PromethION platform with single-molecule real-time sequencing to sequence long reads. 122.55 Gb data were obtained, which covered 91.36 fold of the greater scaup's genome. To assemble a chromosome-level genome of A. marila, a high-through chrome configuration capture (Hi-C) library was built and sequenced using Illumina PE150 platform. In total, we obtained 63.43 Gb raw reads. The six transcriptomic samples were used for RNA and sequenced using Illumina NovaSeq 6000 Platform, 43.9 Gb raw reads of all samples were obtained and used for subsequent gene prediction.
Genome size estimation and assembly. To estimate the genome size, heterozygosity and the proportion of repetitive sequences in A. marila, we performed K-mer analysis using Illumina sequencing data by jellyfish (v2.2.7). While K = 17, the estimated genome size was 1,341.4 Mb, the heterozygosity was 0.47%, and the proportion of repetitive sequences was 42.28% (Table 1). Then, we used NextDenovo (v2.4.0) (https://github. com/Nextomics/NextDenovo) software to assemble the genome with Oxford nanopore technologies (ONT) long reads, and NextPolish 12 (v1.3.1) was used to increase the precision of single base with Illumina short reads. We obtained a preliminarily assembly with size of 1,135.19 Mb after de novo assembly and base correction, which included 194 contigs, contig N50 was 34.42 Mb, and the proportion of GC was 41.94%.
ALLHiC 13 (v0.9.8) and Juicebox (v1.11.08) software were used to mount the contigs in preliminarily assembly onto chromosomes. Contigs were sorted, pruned, and optimized based on the signal strength after Hi-C data were utilized to identify the interaction signals between Contigs. The interaction signals were then shown as a heat map using ALLHiC (Fig. 1). Using Juicebox to correct the position of contigs according to the intensity of chromosome interaction, and finally get a chromosome-level genome assembly with scaffold N50 of 85.44 Mb (Table 2). A total of 106 contigs were mapped to 35 chromosomes with lengths ranging from 2.11 Mb to 207.93 Mb, and 98.28% of the sequences were mapped to the chromosomes ( Table 3). assessment of the genome assemblies. To assess the accuracy of genome assembly, we used Burrows-Wheeler aligner 14 (BWA) (v0.7.8) to map Illumina reads to the genome of A. marila. The reads matching rate was approximately 98.80%, indicating a good agreement between the reads and the assembled genome. Merqury 15 (v1.3) was ran to evaluate assembly quality value (QV), and a high QV (42.14) indicates that this assembly is of good quality. Benchmarking Universal Single-Copy Orthologs 16 (BUSCO) (v5.4.4) (use option "--augustus") and Core Eukaryotic Genes Mapping Approach 17 (CEGMA) (v2.5) were also used to assess the  www.nature.com/scientificdata www.nature.com/scientificdata/ integrity of the assembled genome. The BUSCO results showed that 97.0% of the complete BUSCOs and 0.8% of the fragmented BUSCOs were found in 8338 single-copy orthologs of avers_odb10, and 2.2% of BUSCOs was missing (Fig. 2). Moreover, 238 of 248 core eukaryotic genes were detected using CEGMA. The above results indicate that the genome assembly of A. marila was complete and of high quality.    www.nature.com/scientificdata www.nature.com/scientificdata/ Comparison with tufted duck genome assembly. Mummer 18 (v4.0.0) was used to identify the synteny between A. marila and tufted duck 19 (Aythya fuligula) genomes to determine orthologous chromosome pairs, and we used TBtools 20 (v1.112) to draw the synteny between their chromosomes. The synteny analysis shows that the chromosome sequences of the two species have good synteny, and each chromosome corresponds to multiple chromosomes of another species. There is one sex chromosome Z (Chr4) in A. marila, and it's multiple autosome have some synteny with the sex chromosome W of A. fuligula (Fig. 3). Most of the chromosomes can correspond to the chromosome of the tufted duck, however both Chr31 and Chr32 shared synteny with chromosome 34 of the tufted duck, and the synteny was poor (Fig. 3d). Simultaneously, they also have synteny with some contigs. This may be due to the poor assembly quality caused by too small chromosomes size (Chr31, 2.05 Mb; Chr32, 1.03 Mb). In gap analysis, the chromosomes of A. fuligula genome assembly had a higher proportion of N bases (Fig. 3c). In general, the genome we assembled was of high quality. annotation of repetitive sequences. We used a combined approach of de novo prediction and homologybased prediction to annotate repetitive sequences in the genome of A. marila. For de novo prediction, we used Tandem Repeats Finder 21 (TRF) (v4.09) to detect tandem repeat sequences in the genome. RepeatModeler (v2.0.3), RepeatScout 22 (v1.0.6) and RECON (v1.08) were used to build a database of transposable element (TE) family for de nove prediction. RepeatProteinMask and RepeatMasker (v4.1.2-p1) were used for homology prediction with Repbase database 23 and Dfam database 24 , the species parameter used was chicken. Ultimately, we identified 154.94 Mb of repetitive sequences by all methods, accounting for 13.65% of the assembled genome (Table 4).
Gene structure prediction. We used three methods to predict the protein-coding sequences in the genome of A. marila, ab initio prediction, homology-based prediction and RNA-Seq prediction. Three ab initio prediction software were used in the study, Augustus 25 (v3.3.2), GlimmerHMM 26 (v3.0.4), and Geneid 27 (v1.4.5). For the six transcriptomic samples, the raw data were quality controlled and filtered out bad reads using fastp 28 (v0.23.1) to obtain clean data. Then, the paired-end reads are assembled using SPAdes 29 (v3.15.3). Next, the candidate coding   www.nature.com/scientificdata www.nature.com/scientificdata/ regions in the transcript sequences are identified using TransDecoder 30 (v5.5.0), and the sequences were clustered using CD-hit 31 (v4.8.1) to remove redundant sequences. Eventually, the obtained protein sequences were used for subsequent prediction along with other downloaded sequences of related species. Genomes and annotation files of six related species 19,32-36 (Anser cygnoides, Anas platyrhynchos, Aythya fuligula, Cygnus olor, Cygnus atratus, Gallus gallus) were downloaded from NCBI database to extract the longest transcript and translated them into protein sequences. Their protein sequences were matched to A. marila genome using Spaln 37 (v2.4.6), and then the matched sequences were accurately spliced against the homologous protein-coding sequences using GeneWise 38 (v2.4.1) software. Homology-based prediction results, RNA-Seq prediction results and ab initio prediction results are integrated using EvidenceModeler 39 (EVM) (v1.1.1) to generate gene sets, and including masked repeats as input to the gene predictions. There were 19,257 genes after EVM integration. The BUSCO results showed that complete BUSCOs was 88.2%, fragmented BUSCOs was 1.5%, and missing BUSCOs was 10.3%, which was lower than the BUSCO results of genome (Fig. 4a). Thus, for some single-copy orthologs of aves_odb10, that ware complete in genome but missing or fragmented in predictive gene set, their annotation file were integrated into the gene set. Overall, a total of 19,533 protein-coding genes were predicted in A. marila genome, including 159 prematurely terminated genes and 273 partial genes. The average transcript length of this gene set was 23,126.79 bp, and the average coding sequence (CDS) length was 1,558.71 bp ( Table 5). The BUSCO results showed 95.5% of the complete BUSCOs, 0.3% of the fragmented BUSCOs and 4.2% of the missing BUSCOs (Fig. 4b).
Gene function annotation. Next, function annotation of predicted genes for A. marila using diamond 40 against SwissProt 41     Filtering and verification of gene set. We filtered the predicted gene set, because the number of predicted genes was more than in the six related species, and more than 3,000 genes could not be function annotated. We used OrthoFinder to find the orthologs of A. marila and six related species. There were 4,086 unassigned genes, and 3,421 of them were not annotated in any database ( Table 7). Most of these genes (3,417/3,421) were predicted using at least one of the de novo prediction software, and only four genes are supported by other evidence. The BUSCO test was performed after removed unassigned genes without function annotations, the results showed no effect compared with before removed (Table 8; Fig. 4b). These suggests that these genes may only have specific sequences and patterns, rather than real genes 46 . After filtering the unassigned genes without annotation and 159 prematurely terminated genes, 15,953 genes remain, including 182 partial genes, and 98.96% of them were annotated. The filtered gene set of A. marila was compared with the longest transcript set of related species, it has similar structure (Table 9 and Fig. 5). Except G. gallu, the average transcript length, average CDS length, average exons number, and average intron length of A. marila were smaller than those of others. Simultaneously,    www.nature.com/scientificdata www.nature.com/scientificdata/ the average exon length of A. Marila was the longest among all species. However, these differences were not significant. This may be due to A. Marila has a higher proportion of short genes (Table 5a).

Data records
All sequencing data had been uploaded to NCBI database. The genomic Illumina sequencing data, Nanopore sequencing data and Hi-C sequencing data were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR21672225 47 , SRR21672223 48 , and SRR21672224 49 ). The transcriptome data of 6 samples were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR21700073 -SRR21700078 [50][51][52][53][54][55] ). The genome assembly of A. marila were deposited at GenBank database under the accession JAOQIG000000000 56 . The annotation results of repeats sequences, gene structure and functional prediction were deposited at the Figshare database 57 .   Table 10. Statistical information of Illumina, Hi-C and RNA sequencing data.